Efficient Spatial Keyword Search in Trajectory Databases 



Gao Cong ' Hua Lu § Beng Chin Ooi * Dongxiang Zhang * Meihui Zhang * 

^School of Computer Engineering, Nanyang Technological University 
^Department of Computer Science, Aalborg University 
* School of Computing, National University of Singapore 

gaocong@ntu . edu . sg luhua@cs.aau.dk { ooibc | mhzhang | dxzhang} @comp . nus . edu . sg 



ABSTRACT 

An increasing amount of trajectory data is being annotated with 
text descriptions to better capture the semantics associated with lo- 
cations. The fusion of spatial locations and text descriptions in 
trajectories engenders a new type of top-A: queries that take into ac- 
count both aspects. Each trajectory in consideration consists of a 
sequence of geo-spatial locations associated with text descriptions. 
Given a user location X and a keyword set a top-k query returns k 
trajectories whose text descriptions cover the keywords \|/ and that 
have the shortest match distance. To the best of our knowledge, 
previous research on querying trajectory databases has focused on 
trajectory data without any text description, and no existing work 
has studied such kind of top-k queries on trajectories. This paper 
proposes one novel method for efficiently computing top-k trajec- 
tories. The method is developed based on a new hybrid index, cell- 
keyword conscious B + -tree, denoted by B ck -tree, which enables us 
to exploit both text relevance and location proximity to facilitate 
efficient and effective query processing. The results of our exten- 
sive empirical studies with an implementation of the proposed algo- 
rithms on BerkeleyDB demonstrate that our proposed methods are 
capable of achieving excellent performance and good scalability. 

1. INTRODUCTION 

With the increasing popularity of crowdsourcing, as well as the 
advancements and miniaturization of handheld devices with GPS 
receivers, massive amount of data that are geo-tagged or associ- 
ated with text information are being generated at an unprecedented 
scale. For example, crowdsourcing of motion trajectories is ap- 
plied to generate the Open map systems (e.g., openstreetmap.org 
and waze.com). 

Users have crowdsourced huge volumes of trajectory data that 
are annotated with keywords or text descriptions. In such datasets, 
a trajectory is composed of a sequence of places and line seg- 
ments connecting these places. The places in a trajectory, cap- 
tured as spatial locations, are often associated with text descrip- 
tions. Figure [TJ shows an example of a trajectory. Such trajectories 
come from various sources, and we name just a few in the fol- 
lowing: 1) In many GPS-trajectory-sharing websites (e.g., Moun- 
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tain Bike: www.bikely.com, GPS sharing: www.gpssharing.com, 
GPSies: www.gpsies.com, and Geolife [311), people upload their 
travel routes. To record their journeys or share life experiences with 
others, they often attach texts and multimedia content (e.g., photos) 
as annotations to the places in their trajectories. 2) In location- 
based social network services (e.g., Foursquare), each place is as- 
sociated with tags and users can check in such places. The check-in 
sequence of a user in a period forms a trajectory. The places can 
points of interests of any kind, e.g., restaurants, shops, and thus, the 
trajectories can be of various types, such as travel trajectories and 
daily life trajectories. 3) Trajectories with text descriptions can be 
extracted from travel itineraries 11161 , as well as Flickr photos 1191 . 

Such publicly accessible datasets serve as an informative repos- 
itory to users. A user may want to find others' travel routes that 
are relevant to his/her interests and that have a short travel distance. 
Motivated as such, we consider queries that search previously ex- 
plored routes of places that satisfy a user's interests or needs, ex- 
pressed as a set of keywords, and that may also lead to the shortest 
total traveling distance. The results of such a query exploit the col- 
lective intelligence of crowdsourcing. 

In addition, users may be interested in learning the daily life ex- 
perience of others. For example, from relevant social network ap- 
plications, it is easy to derive a shopping trajectory database, where 
each place (corresponding to a shop) in a user-generated trajectory 
is associated with the items bought by the user at that place. Such 
a user-generated trajectory indicates the user's preferences. Sup- 
pose that a user has a shopping list of product names. She would 
like to see the routes of other users who buy all the items on the 
list, and the traveling distance from her starting location along this 
route that is the minimum. 

The fusion of spatial locations and text descriptions in trajecto- 
ries demands efficient processing of queries that involve both at- 
tributes. Indeed, the aforementioned GPS sharing websites already 
support a type of queries related to both text and locations, namely 
the keyword range queries, to help users share, browse and search 
GPS trajectories. They allow users to specify a region and a set 
of keywords, and return the trajectories that are inside the query 
region and contain the set of query keywords. However, the algo- 
rithms used are not publicized, and the response for answering such 
queries in these websites is very slow. 

Existing research on querying trajectory database has focused 
on trajectory data without any text description. For example, a k- 
Nearest Neighbor query [ 13] returns the k nearest moving object 
trajectories to a given query point based on the minimum distance 
from the query point to a trajectory. Querying trajectory data is 
time consuming and therefore, indexes such as the R-tree and its 
optimized versions for trajectories have been used. 

To the best of our knowledge, no publication considers query- 
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ing trajectories that are composed of a sequence of geo-locations 
associated with text descriptions. 

In this paper, we introduce a new problem: the top-k spatial key- 
word query (T£SK) on trajectories. Given a large database of tra- 
jectories, a TfcSK query consists of a spatial element (query loca- 
tion) and a set of keywords, and it returns the top-fc trajectories with 
the shortest match distance. The match distance is measured by the 
sum of two distances: the length of a sub-trajectory covering all 
query keywords, and the distance from the query location to the 
start location of the sub-trajectory. It is a challenge to efficiently 
answer the TfcSK query on trajectories associated with text. 

To this end, we propose a novel solution with the following fea- 
tures. First, we develop a new index for trajectories, called cell- 
keyword conscious B + -tree, denoted by B ck -tree. B ck -tree inte- 
grates spatial information captured by location keys generated by 
adaptive cells and text information such that it enables simultane- 
ous application of both spatial proximity and keyword matching in 
query processing. The B ck -tree is efficient for queries as well as 
updates, and it is adaptive to varying workloads. Further, with the 
use of the B + -tree that is available in all mainstream DBMSs, our 
proposed solution can be easily grafted onto existing database sys- 
tems. Second, based on the B ck -tree, we develop an algorithm for 
choosing candidate trajectories that are close to the query location 
and contain the query keywords, and thus are more likely to be the 
results of a TfcSK query. Third, we propose a linear time algorithm, 
called Match, for efficiently computing the match distance between 
a query and a candidate trajectory, which contrasts with a straight- 
forward method that takes quadratic time. 

Since no baseline algorithms exist for processing TfcSK queries, 
we also develop four baseline algorithms. They all use the proposed 
algorithm Match for computing the matching distance. They differ 
in their ways of finding candidate trajectories: 1) The first one uses 
the inverted list index to choose the trajectories containing all query 
words. 2) The second uses the R-tree to retrieve nearby trajectories. 

3) The third is based on the IR-tree [ 10], treating each trajectory as 
a whole to retrieve nearby trajectories containing query keywords. 

4) The fourth extends the TB-tree [22], an existing index for trajec- 
tories, to incoiporate the text information organized in an inverted 
index, and uses the extended TB-tree to retrieve candidate trajecto- 
ries. 

In summary, the paper's contributions are threefold. First, we in- 
troduce and formalize a new type of queries on trajectory data that 
are associated with words. Second, we propose a novel solution for 
efficiently processing T£SK queries. The proposed solution con- 
sists of a new index structure B ck -tree for trajectories associated 
with words, an approach to computing the minimum match dis- 
tance between a trajectory and a query, and a top-k query process- 
ing algorithm. The proposed solution can be implemented on top 
of existing DBMSs cost-effectively. We also explore other ways of 
answering TfcSK queries as baseline methods. Third, with an im- 
plementation of the B ck -tree based algorithm on BerkeleyDB, we 
conduct an extensive experimental study, which includes a compar- 
ison with the four baselines. The experimental results demonstrate 
the efficiency and scalability of our proposed solution. 

The rest of the paper is organized as follows. Section |2]defines 
the T£SK query. Section [3] presents the baseline algorithms. Sec- 
tion|4]details our solution for processing the TfcSK query. Section[5] 
reports the experimental study. Section [6] reviews related work. 
Section|7]concludes this paper. 

2. PROBLEM STATEMENT 

In this section, we give the problem statement and provide nec- 
essary definitions and background. 
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Figure 1: Example 



Data Let © be a dataset in which each object is a trajectory. 
Definition 1: Trajectory 

Each trajectory 13^ 6 © is defined as a sequence of places (points 
of interests) 1% = Li, • • • , L„ • • • , L„. □ 

Each place L is represented by a pair (LA, L.\|/), where L. X repre- 
sents a geo-spatial point location and L.\|/ denotes a set of keywords 
(e.g., the description about the place). 

We denote the union of the text description of each place in tra- 
jectory T3t by Tity = U?=l ^ Li-V' 
Definition 2: Sub-Trajectory and Contain 

We define a sub-trajectory as a subsequence from place s to place 
e of trajectory as T^,.Lf, s.e G < e. Given two sub- 

trajectories T3^.L"\ and T^.L^, we say that T^.L^j contains 
if si < s2 and el > el. □ 
We denote the union of the text description of each place in sub- 
trajectory ric by TC-v = ULs H-Li-f- 

Query A spatial keyword query q = {X, \|/) has two components, 
where q.X is a spatial location and q.\\f is a set of keywords. The lo- 
cation descriptor q.X specifies the location preference of a user, and 
q.\\f indicates the preference of a user on the keywords of objects. 

Definition 3: Match 

We say that a trajectory 1%. matches a query q if the following 
condition is satisfied. 

q.Mf C <ZX. V 

Similarly, we say that a sub-trajectory 1%^.Y. e s matches a query if 

Intuitively, we say that a trajectory 1%. matches a query q if all 
the keywords of the query are contained in the text of the trajectory. 

Definition 4: Minimum Match 

We say that a sub-trajectory T^.Lf is a minimum match of a query 
q if (1) Ti^.Lj matches q; and (2) no sub-trajectory of T^.Lf 
matches q. □ 

Example 1: Refer to Figure [TJ A traveller wants to find a route in 
which she can see waterfall, panda and kiosk. There are two min- 
imum matches to the query {waterfall, meadow, kiosk}: L2— >L3 
and L3— >L/j. Note that L2— >L3— >L4 is a match but it is not a 
minimum match. □ 

Definition 5: Match Distance 

If a sub-trajectory T%^.L e s matches a query q, the match distance 
matchDist(<5>, 1%^.L e ^) is defined as follows: 
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matchDist(<7, T^.L e s ) = min{Dist(g, 9X.L S ), D\st{q, T1^.L e )} 

e-l 

+ £Dist(T3?..L /) T^.Li + i)), 

where Dist(.,.) is the Euclidean distance between two locations. 
If a sub-trajectory T3^.L e s does not match a query g, the match 
distance is defined as °o. □ 

Definition 6: Minimum Match Distance 

If a trajectory matches a query q, the minimum match distance 
minMatchDist(g, 1%) is defined as follows: 

minMatchDistf^.lX) = min(matchDist(g,'7X.I4)), 

(s.e) 

s.t., T^.Ljis a minimum match of q. 

□ 

Example 2: Consider the example in Figure Q] When we take her 
current location q.X into account, sub-trajectory L3— >l_4 is the 
one with the minimum Match Distance. □ 

Definition 7: Top-k Spatial Keyword query (ITcSK) 
Given a trajectory set £>, a top-k spatial keyword query (T£SK) with 
q — {X, v|/) returns from D k trajectories that have the smallest min- 
imum match distances with respect to q, each associated with the 
start and end place indexes that yield the minimum match distance. 

Formally, aTA:SK query returns a set Ans((D,q) of A; triples (t,s,e), 
where t € 2>, 1 < s < e < \t\, such that 

1. \Ans(CD,q)\ = \%i{Ans(D,q))\ = k, where 3lj(.) denote the 
projection on the first attribute of a set of triples of the format 

(t,s,e). 

2. \f(t,s,e) eAns(V,q), matchDist^f .L e s ) = minMatchDist(<?,0; 

3. V(f,s,e) eAns('D,q),W e <D\%i(Ans{2),q)), the following 
inequality holds: minMatchDist(g,f) < minMatchDist(<?,?'). 

Intuitively, the answer to the query consists of k sub-trajectories 
from k distinct trajectories whose minimum match distances to query 
q are the smallest. □ 

3. BASELINE ALGORITHMS 

No baseline algorithm exists for the top-A: spatial keyword queries 
on trajectory data. We develop four baseline algorithms. The four 
baseline algorithms constitute a contribution to the problem of pro- 
cessing the top-fc spatial keyword queries in that they explore the 
possibility of using existing index techniques for the new problem. 
The four baseline algorithms employ the algorithm (presented in 
Section [4~3T > for computing match distance. Baseline 4 is lengthy 
and is described in Appendix. The baseline algorithms act as back- 
ground for better understanding of the problem and its complexity. 

3.1 Baseline 1: IF 

The first baseline, IF, uses Inverted File as the index structure. 
Specifically, it aggregates the text description associated with each 
place in a trajectory to get a set of words of the trajectory, and then 
builds inverted file for all the trajectories. 

The idea of the IF algorithm is to use the inverted file to filter 
out the trajectories that do not contain all the keywords of query q, 
i.e., finding the set of trajectories T m that match the query. Then for 
each trajectory in T m , we compute its matchDistance to the query 
using the algorithm presented in Section |4~3l and find the top-fc tra- 
jectories. 



3.2 Baseline 2: RT 

The second baseline, RT, uses an R-tree [ 14- 1 as the index struc- 
ture. Specifically, it aggregates the MBR associated with each place 
in a trajectory to get the MBR of the trajectory, and then uses an 
R-tree to index all the trajectories. For each trajectory, this base- 
line uses a separate index structure to organize the text description 
associated with places of the trajectory as the component 2 in Sec- 
tion|4~T1 

Given a query q, the baseline uses the R-tree to find the nearest 
trajectory incrementally. For each nearest trajectory, we check if it 
matches the query keywords. If yes, we compute its matchDistance 
to the query using the algorithm in Section 14.31 In the process, 
the algorithm keeps track of the minimum match distance of the 
current Arth trajectory, denoted by threshold. For a newly "seen" 
trajectory with spatial distance dist to query q, if the score dist ex- 
ceeds threshold, the algorithm stops since it is guaranteed that all 
"unseen" trajectories will not have smaller match distance than the 
current k'th trajectory (and thus cannot be in the result). Note that 
dist is a lower bound of the minimum match distance. 

3.3 Baseline 3: IRT 

The third baseline, IRT, uses the IR-tree 1 10] as the index struc- 
ture, which is used to index spatial Web objects. The IR-tree is 
essentially an R-tree 1 14] extended with inverted files [33 1 Each 
leaf node in the IR-tree contains a number of entries of the form 
(p,p.X), where p refers to the identifier of a spatial object, and 
pX is the bounding rectangle of p. Each leaf node also contains 
a pointer to an inverted file for the text descriptions of the objects 
stored in the node. Each non-leaf node R in the IR-tree contains 
a number of entries of the form (cp,rect,cp.di) where cp is the 
address of a child node of R, rect is the MBR of all rectangles in 
entries of the child node, and cp.di is the identifier of a pseudo text 
description of the child node. The pseudo text description is a union 
of all text descriptions in the entries of the child node. Each non- 
leaf node also contains a pointer to an inverted file for the pseudo 
text descriptions of its child nodes. The pseudo text description en- 
ables us to prune a node (and the subtree under the node) if it does 
not cover all the query keywords. 

To use the IR-tree 1 10 1 to organize the trajectories, we aggregate 
the MBR associated with each place in a trajectory to get the MBR 
of the trajectory; similarly we get the set of words of the trajectory 
by aggregating the text description of each place. 

We adapt top-k algorithm presented in [ 10] that is based on the 
best-first search to find the top-k trajectories. A priority queue U 
is used to keep track of the nodes and trajectories that have yet to 
be visited. The values of minDist(g, .), which is the minimum Eu- 
clidian distance between q and a trajectory (or a node), are used 
as the keys. Note that the key used for a trajectory in U is not the 
match distance, but a loose lower bound of the match distance be- 
tween query and trajectories in a node. It is used to choose which 
node to visit next and when to terminate the algorithm. When de- 
ciding which node to visit next, the algorithm picks the node CN 
with the smallest minDist(g,C/V) value in the set of all nodes that 
have yet to be visited. The algorithm terminates when the match 
distance of kth trajectory is smaller than the key of first element in 
U. Algorithm [T]shows the pseudo-code. 

3.4 Discussion 

The first baseline IF uses the text information to prune the search 
space without utilizing the spatial information to speed up. The 
second baseline RT uses the spatial information to guide the search 
for results without utilizing the text information. 

Different from the first two baselines, the baseline IRT (and the 
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Algorithm 1: IRT ( query q, Tree root root, Integer k) 

l V <— new max-priority queue of k elements of °°; 
1 [/(- new min-priority queue; 

3 £/.Enqueue(rao(,0); 

4 while U is not empty do 
e <— l/.Dequeue(); 
if (minDist (q.X.e.X) > V[k]) then 

|_ break while-loop; 

if e is a trajectory then 
|_ update V by (e,Match(q,e,V[k])); 

else lie points to a child node 

read the node CN of e; 

read the posting lists of CN for keywords in q.Mf; 
for each entry e 1 in the node CN do 

if q.Mf C e' .13/ and minDist (q. X, e' Ik) < V[k] then 
|_ [/.Enqueue(e',minDist(^A,e'.X)); 
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16 return {V}; II top-i: results 
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baseline given in Appendix) is able to make use of both text infor- 
mation and distance information to prune the search space. How- 
ever IRT faces the following challenges: The MBR of a trajectory 
can be much larger than the real geographical space of places in 
the trajectory, and thus the MBRs of nodes in the IR-tree have large 
overlapping. The text description of a trajectory is the aggregation 
of the descriptions of all places in the trajectory. Hence, the over- 
lapping of text descriptions between nodes with large overlapping 
MBRs is also large. Thus the pruning power of the text information 
associated in the IR-tree nodes might be limited. 

4. PROPOSED ALGORITHMS FOR QUERY 
PROCESSING 

Section [47X1 presents the proposed index B ck -tree. Based on the 
index, we present the incremental expansion algorithm for finding 
candidate trajectories in Section |4~2l Section |4~3l presents an algo- 
rithm for matching a candidate trajectory with a query. 

4.1 Proposed Index: B ck -tree 

Ideally, we can index trajectories associated with text informa- 
tion to enable pruning search space by utilizing both spatial dis- 



tance and keyword information for efficient query processing. It 
is, however, challenging to develop indexes to meet the complexi- 
ties of trajectories associated with text information. To this end, we 
propose an index, called cell-keyword conscious B + -tree, denoted 
by B ck -tree, which comprises two components. 

1) Component 1 is used to locate the IDs of trajectories that are 
close to the query location and contain all the keywords. It is used 
to organize the segment-level information of trajectories. 

2) Component 2 is used to compute the minimum match distance 
of a selected trajectory to query q. It is used to organize the detailed 
information of each trajectory. 

Component 1: We divide the spatial region of dataset D into 
quad cells of various sizes to generate location codes. The ID of 
each cell can be generated by using the bit-interleaving method (T). 
If a quad cell consists of a set of uniform cells, the minimum ID of 
the set of cells will be the ID of the quad cell. Figure |2(a)| shows an 
example. Based on the cell division, we build a B + -tree to index 
trajectories together with their text descriptions. Each leaf entry 
contains three elements: 

• wordID: it denotes the ID of a word in the trajectory database. 

• celHD: it denotes the ID of a cell that contains a wordID. 

• posting list: it is a sequence of trajectory identifiers for each 
wordID and celHD, i.e., the list of trajectories in cell celHD 
that contain word wordID. 

In the index, the entries are organized first by the word ID, and 
next by the cell ID. Hence, the posting lists for the same word are 
organized together, and posting lists of nearby cells for the same 
word are together. This enables visiting nearby cells for a word by 
following the pointers between leaf nodes of B + -tree. 

All distinct words in the text description of the trajectory database 
constitute a vocabulary, and each word has a wordID. We proceed 
to explain the other two elements, celHD and posting list. 

celHD: The celHD element aims to integrate the spatial infor- 
mation and text information of trajectories. We partition the index 
space into cells, and thus one trajectory may span multiple cells. 
The sizes of cells are not fixed. We set the size of a cell such that 
the number of trajectories in a cell is smaller than a threshold %. 
Note that the empty cells are not indexed. 

posting list: Given a set of query words, we need to check if 
a cell contains a trajectory that covers all the query keywords. To 
meet the need, for each wordID w, and celHD c, a posting list is a 
sequence of identifiers of the trajectories such that part of the trajec- 
tory or the whole trajectory falls in cell c, and the text description 
associated with the trajectory segment in c contains word w. Such 
a design enables associating the cell ID, which represents the spa- 
tial information of a trajectory segment, with the text information 
of the segment. 

Example 3: In Figures [2(a)|2(b)| for cell 52, we generate one entry 
(a, 52, <tl,t3>) since the fragments of trajectories f 1 and f3 in cell 
52 contain word a. Similarly, we generate another entry (b, 52, 
<tl, t3>) for cell 52. As another example, for cell 28 two example 
entries (out of totally 7) include (g, 28, <tl>), (b, 28, <tl>). Here 
we do not include the detailed information on which places contain 
a specific word for a trajectory. It is also noteworthy that empty 
cells are not indexed. □ 

However, the above design will be problematic at query time 
when a trajectory spans multiple cells, and individual fragment 
does not contain all the query words, but several fragments together 
match all the query words (which will become clear in the next sec- 
tion). A simple fix is to associate each cell with all the words of a 
trajectory. However, this significantly increases the space cost. 
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We next present a carefully designed mechanism that needs less 
space while returning the correct results. Suppose that a trajectory 
1%^ falls in m cells. We denote a trajectory fragment as TXj, 
i £ [I, m], where T^_ ; is adjacent to r T3^ i+l in the trajectory. We 
associate words for each trajectory fragment as follows. 

• If i is odd, the set of words for fragment 1%^ is the union of 
words in the places in the fragment. 



If i is even, the set of words for the fragment 1%^ will be 
^min{i+l,m} qq^. ^ where T^.u/ is the union of words in 
the places in fragment 1%.y 



For example, consider trajectory ?2 in Figure [3] with three seg- 
ments in three cells. Each of the three segments contains a term. 
According to the proposed mechanism, we associate segment 1 
with a, segment 2 with a, b, c and segment 3 with c. 

This method can guarantee the correctness of the proposed al- 
gorithms and we prove this in Lemma [2] Note that one cell can 
contain multiple fragments of the same trajectory, and the afore- 
mentioned method is equally applicable. 

Component 2: We use a B + -tree to organize the place informa- 
tion and the associated keywords in all the trajectories. The text 
description for a place can be either short or long. We use inverted 
list for each trajectory. Each entry consists of three elements: tra- 
jectory ID, word ID, list of place IDs in the trajectory, where tra- 
jectory ID and word ID compose the key of the B + -tree. Note that 
the inverted file is the most efficient index for text information re- 
trieval 1 3 3 1 . 

We discuss the updating process of the B ck -tree in the Appendix. 
Algorithm 2: IE(query q, result size k) 

1 V new min-priority queue of k element of °°; / / maintain top-k 

trajectories 

2 r'^0; 

3 while true do 
rtjj compute a range radius; 

Rj <— construct a range with q as the center and rqt as extension; 
if ijtO then 
L Ri^Ri-Ri-i; 

A CTR( q, Rf) ; // See Procedure ICTRl 

9 for each trajectory t in A do 
10 read post lists of t for keywords in q; 

dist <- Match(#,i,V[fc]); // See Section [PI 

if V[k] > dist then V.add(dist,t); 

13 itV[k] < rqt then break; 

14 i <— i + 1 ; 



15 return top-k trajectories in V : 



I / top-k results 



The proposed index solution B ck -tree can be implemented using 
DBMSs that support the B + -tree, and is update friendly. It enables 
designing algorithms for processing TkSK queries that are able to 
prune the search space using both types of information. In addition 
to the TkSK queries, it also support other types of queries contain- 
ing a keyword component and a spatial component, e.g., finding 
trajectory containing a set of keywords within a region. Different 
from the IRT baseline that takes each trajectory as a whole and 
associates text information with the trajectory, in the proposed in- 
dex we use cells to divide trajectories into segments and design an 
effective mechanism to associate keywords to segments. For exam- 
ple, in Figure [3] given a top-1 query q with keyword b, rl is the 
answer. If we use the proposed word association mechanism, we 
can prune t2 since the segment in the first cell is associated with 
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Figure 3: Associating words with trajectory segments 

word a only. However, in the IRT, we cannot prune t2 since it takes 
the trajectory as a whole. As another example, if q is to find tra- 
jectories whose segment contains b and falls in the circle, we can 
prune t2 while r2 cannot be pruned if treated as a whole. 

4.2 Incremental Expansion Algorithm (IE) 

We compute top-£ trajectories by iteratively performing range 
queries with an incrementally expanded search region on the B ck -tree 
until the top-fc matching trajectories are retrieved. The Incremental 
Expansion algorithm (IE) is outlined in Algorithmic] The algorithm 
IE first initializes a priority queue to maintain top- A: results. We con- 
struct a range query with q as the center and a query dependent rqo 
as the extension (lines 4-5). 

To compute extension rqo, we take into account both keyword 
information and spatial information. Let p(q.\\i) be the probability 
of containing q.iy as the keyword set of a trajectory of ©. We esti- 
mate the probability by p(<y.V|/) = Y\weqM/P{ w )> where p(w) can be 
estimated using the maximum likelihood estimation, ;.e., the prob- 
ability of a trajectory in dataset 2) that contains the word. We com- 
pute rqo by 



rqo - 



kxL 



7CX I ©I X p{q-\\l) ' 



where L is the area size of the whole region, since a region of area 
size 71 x rq^ would probabilistically contain segments of k trajecto- 
ries that contain all query keywords if the trajectories are uniformly 
distributed in the whole region. 

For a trajectory in the range, we check if it contains all the query 
keywords, and compute its matching distance by invoking func- 
tion CTR (Candidate Trajectory Retrieval, to be presented shortly). 
For each returned candidate trajectory t in A, the algorithm invokes 
algorithm Match(g,r, V[k]) (See section l4~3l > to compute the mini- 
mum match distance between q and t . If the match distance of the 
£th result is larger than rqi, it is safe to terminate the algorithm be- 
cause the algorithm has considered all the trajectories that can pos- 
sibly be in the top-fc results. Otherwise, we compute a new range 
rqt = rq,_ \ +x, where % is the side length of the smallest quad cell in 
the index. We also tried other options, e.g., the average side length. 
However, the performance of such options is worse in general. We 
then retrieve trajectories in the region formed by radius rq/, but not 
included in the region formed by rq\- i . 

We proceed to present procedure CTR, which checks if a trajec- 
tory in the given range R contains all the query keywords. CTR 
processes query keywords one by one. For the first query keyword, 
we find the trajectories that contain the query keyword, and inter- 
sect with the given query range R. Recall that B ck -tree organizes 
the list of trajectories by word ID and then by cell ID. This enables 
us to retrieve those trajectories that contain the query keywords and 
fall in certain cells. For each subsequent query keyword, we filter 
out trajectories that do not contain the query word by scanning the 
corresponding cells. 

The candidate trajectory retrieval (CTR) algorithm is outlined in 
Procedure ICTRI It takes two arguments: query q and the given re- 
gion R. It first computes the intervals of cell IDs that are covered 
by the query region R (line 2). The algorithm proceeds to process 
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Procedure CTR( q, R) 



1 A <— new array; // maintain the trajectories to be checked 

2 I <— Compute the intervals {(sidj,eidj)} of R;// start cell id 



3 
4 

5 
6 
7 
8 

9 
10 

11 
12 
13 

14 
15 



for each keyword Wj (i = l,...,\q.\s/\) do 
for each inten>al Ij ={sidj,eidj) in I do 
(CjJ'j) <r- getCelllnlnterval(wi, If); 
if/' =0 then 



remove /; from /; 
continue; 

else 

I update / with Ij ; 

if (' = 1 then 

for each cell c in Cj do 
|_ add trajectories in c to A; 



else 



removes trajectories in A that are not covered by any cell 
c in Cj 



16 return A; 



each query word w; (lines 3-15). For each interval, it returns the 
trajectories that contain keyword w, and fall in the interval. Func- 
tion getCelllnlnterval(.) returns in Cj those cells that contain word 
w; and fall in the interval Ij. The function is implemented by fol- 
lowing pointers between leaf nodes of B + -tree, and the jump tech- 
nique 1 23 1 is used to optimize the implementation by jumping over 
pages. If interval Ij = (sidj,eid /) does not contain any cell contain- 
ing word W(, we remove the interval from consideration (lines 6-8). 
Function getCelllnlnterval(.) also returns a smaller interval Ij if the 
interval covered by cells in Cj is smaller than /;. We use /; to 
update the interval boundary sidj and eidj (line 10). For the first 
keyword vvi , we add the trajectories that contain word w\ and are in 
the region R to the set of candidate trajectories A (lines 1 1-12). For 
each of subsequent keyword, the algorithm filters the trajectories in 
A that do not contain the keyword (line 15). In the implementation, 
we organize trajectories in A by cell ID and filter trajectories in a 
cell if the cell does not contain a query word. 

We process query words in the ascending order of their frequen- 
cies, i.e., infrequent words are processed first. The reason is that 
infrequent keywords are more likely to prune trajectories. 

Before we prove the correctness of the proposed algorithm, we 
first present a lemma. 

Lemma 1: Consider query q and a trajectory 1%, that falls in m 
cells. We denote a trajectory fragment as TfR_j, i £ and the 

cell containing trajectory fragment TS(j as cell(ft%f). If the mini- 
mum match of1%^for q that results in the minimum match distance 
follows in cells \c\ , cx], 1 < cj < ci < m, we have minMatchDist(^, T^J 
> max e , e[c] iC2 ] (minDist(q, cj)). 

Proof Sketch:Bai'ed on triangular inequality, we know that 
minMatchDist(g,T^J > Dist(g,T!^ L ), where T%x ' s a place in 
the sub-trajectory of 1%^ that is a minimum match of query q. It 
is easy to see that minDist (q , c j) is not larger than Dist(g, T'K.l) 
where L is in cell Cj. We complete the proof. □ 

According to Lemmafj] the minimum match distance of a trajec- 
tory to the query is larger than the distance from query to any cell 
that contain parts of the matching trajectory. We are now ready to 
present the correctness of the IE. 

Lemma 2: The Incremental Expansion algorithm guarantees to 
find top-k trajectories using B c ^-tree that employs the method (in 



Section \4Tl\ of associating the words of the trajectory with the dif- 
ferent segments. 

Proof Sketch'.Suppose that a trajectory 1%^ falls in m cells. We 
denote a trajectory fragment as T%^, i£ [1,™]- Two cases cover 
all the possibilities that 1%^ is one of the top-k results. 

Case I: ifTH^ contains all the query words and the cell contain- 
ing is in the range of the match distance ofkth trajectory in the 
current result set (i.e., mindist(cell(1 > minmatchdist(TR.q)) 
the trajectory TQ^ will be retrieved and we compute its match dis- 
tance. In this case, algorithm IE will not miss the trajectory 1%^ 
if it is a result. Note that if the cell containing T%^[ is not in the 
range, T%.i cannot be matching part of top-k trajectory. 

Case 2: we next consider the case that none of the sub-trajectories 
contains all the query keywords, but the trajectory contains all the 
query keywords. We first consider that two subtrajectories in two 
adjacent cells cover the query keywords. The method of associating 
keywords with cells make sure that one of the two subtrajectories in 
the two cells will be associated with at least the keywords of both 
subtrajectories. According to Lemma [7] the distance from query 
to the cell containing a sub-trajectory associated with all keywords 
of the two adjacent cells must be smaller than the match distance 
between the trajectory and query. This grantees that algorithm IE 
will not miss the trajectory 1%^ if it is a result. Similarly, when 
more than 2 adjacent cells together cover the query keywords, at 
least one of the subtrajectories ofl'R^ in these cells contain all the 
keywords of these sub-trajectories according to the method of as- 
sociating keywords to sub-trajectories. Thus, algorithm IE will not 
miss the trajectory T%^if it is a result. 

The two cases cover all the possibilities that a trajectory can be 
a top-k result. Therefore, Algorithm IE is correct and complete. □ 

4.2. 1 Cost Analysis of IE Algorithm 

First of all, it is noteworthy that our incremental expansion algo- 
rithm (IE in Algorithm |2) has an asymptotically equivalent effect 
of a window search through the specific B + -tree index that is also 
known as a linear quadtree. In particular, such an equivalent query 
is centered at query location q.X and its window size is bounded 
by the place L; fljr that our algorithm fetches as the last place on the 
trajectory that contributes to the A>th minimum match distance. 

To make the analysis clear, we assume that k is 1, i.e., we only 
get the top-1 trajectory with minimum match distance. Let the dis- 
tance from q.X to L[ ast through the trajectory, i.e., the correspond- 
ing minimum match distance, be Dist(q.X,\-[ ast ). This matching 
distance can be used as the half window size in the aforementioned 
equivalent window query. 

The matching distance Dist(q.X, L/ a „) is not a Euclidean dis- 
tance since we work with trajectories. Nevertheless, we use a Eu- 
clidean distance value equal to Dist(q.X, L/ flJ/ ) as the half window 
size in the equivalent window query. This justified by the fact that 
any place out of the window thus determined must result in a larger 
matching distance than does L/ flJf . In this sense, our algorithm 
does not need to visit any farther places out of the window. On 
the other hand, we cannot reduce the window size to a value less 
than Dist(q.X, L/ as/ ) because this distance itself can be a Euclidean 
distance if all involved places and q.X are in a same straight line. 

Aboulnaga and Aref [ 2 ] proposed a cost model for window query 
processing in linear quadtrees. Given a query window W and a 
quadtree T, the model estimates the query cost by recursively count- 
ing the quads that overlap or are enclosed by W. This model can 
be employed here to estimate the 10 cost our algorithm incurs in 
searching for trajectories with minimum match distance. 

Next, we elaborate on how to estimate the minimum match dis- 
tance Dist(q.X, \-[ ast ) since it determines the window query size. 
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Let K be the total number of keywords in the entire space of inter- 
est, C be the maximum number of places per trajectory, and Q be 
the number of keywords in q, i.e., Q = \q.y\. Our analysis needs 
the information about the trajectory places distribution in the space, 
as well as the keywords distribution on all trajectory places. Both 
distributions can be very complicated due to many hard-to-describe 
factors including environment and humans. We hereby make two 
simplifying assumptions. We assume there are w keywords on av- 
erage per trajectory place, and no keyword is repeated across places 
within a same trajectory. 

We count how many places our algorithm visits on the returned 
trajectory, i.e., the one with the minimum match distance Dist(q.X, 
\-last). The counting starts at the first place where at least on re- 
quired keyword in q.iy is included, denoted as L s , and ends at place 
\-last- Note that both L, and L/ flJ , must have at least one required 
keyword in q.\\f. We use Pr(i) to denote the probability that L/ aj , is 
the i-th place inclusively from L s . 

The probability of a single place containing the query words q.\\f 
is computed by 

Pr{\) = pr(q £ L.\|/) = | ] {pr(w £ L.\|/)) 

weq.Vf 

= II (l-/^/L. V/ ) |Lvl ) 

= n (i-(i-^w) |Lvi ), (i) 

where q is the query, L.\|/; (i £ [1,|L.\|/|] ) is a word in L.\|/, and 
pr(w) is the probability that a word in a place L is the query word 
w. 

We say that i places "jointly" contain the query words if 1) the 
i places cover the query words, 2) the first place and the last place 
must contain some query keywords, and 3)none of proper subsets 
of the i places contain all the query words. We denote the proba- 
bility that i places "jointly" contain the query words by Pr(i). To 
compute it, we first compute the probability that a subset of the i 
places contain the query words pr(i) , which can be computed as 
we do in EquationQ] that is, 

pr{i)=pr{qe\J J=l Lj)= II (l-(l-prHf^) 

We next compute the probability that each place in a subset of 
the i places contains all the query words. 

Pl(') = iC)PrW j *(l-PrWY- j 
7=1 

where Pr(l) J is the probability that each of the individual j places 
contains all the query words, and (1 — Pr(l))'~ J is the probability 
that each of the other i — j places does not contain all the query 
words. 

We next compute the probability that a proper subset of the i 
(i > 2) places jointly contains the query words such that the subset 
does not contain the first and the last places of the i places and none 
of single places contains all the query words. 

MO = E((j) - (^r^)Pr{j)*i}-Pr{i-j)) 

where Pr(j) is the probability that j places of the i places jointly 
contain the query words, and (1 — Pr(i — j)) is the probability that 
the other i — j places do not contain the query words. 
Finally, we are ready to compute Pr(i). 

Pr(i)=pr(i)- Pl {i)-p 2 (i) (2) 



As an example, the probability that two places L\ and L2 jointly 
contain the query words q.\\i is Pr(2) = pr(2) — pi(2) = pr(2) — 
2Pr(l)*(l-Pr(l))-Pr(l) 2 . 

Consequently, the expected number of places to visit is i • 
Pr(i). Assuming that the average segment length of all trajectories 
is len, the expected distance from place L s to place L; flJf is len ■ 

Finally, we estimate the Euclidean distance between query lo- 
cation q.X and place L s , i.e., Disl(q.X, L,). Suppose there are Y 
trajectories in the entire space, which results in Y ■ C places in to- 
tal. The average Euclidean distance between two adjacent places is 
Lj \/Y ■ C, where L is the side size of the entire space. On average 
we need to visit \K/w ■ Q] places to see a required keyword in q.m. 
Asaresult, Dist(g.X, L s ) is approximated by Lf\jY ■ C ■ \K/w-Q\. 

To put it altogether, Dist(q.i, L lasl ) ksL/VY -C- \K/w Q] +len- 
' ' P r {i)- As mentioned above, this distance and the query lo- 
cation q.X together determine the window query whose cost can be 
estimated using the model proposed by Aboulnaga and Aref J2] . 

4.3 Computing Match Distance of a Trajectory 

We present algorithm Match for searching the minimum match of 
a selected trajectory to a query, and computing the match distance. 
Match is invoked by algorithm IE and our baseline algorithms. 

Given a trajectory 1%, = Lj , L„ and a query q, a naive ap- 
proach to finding the minimum match is to check all possible sub- 
trajectories in 1%; For each sub-trajectory, we check if it is a 
match of the query q; if it is, we compute the match distance. Fi- 
nally, we get the minimum match distance. The time complexity of 
the naive approach is 0(|T^J 2 ). 

We proceed to develop an approach with 0(|T^_|) complexity 
based on the principle of divide and conquer and the idea of dy- 
namic programming. Specifically, we divide the problem into sub- 
problems, each of which is to search the minimum match starting 
from a place in a trajectory 1%.. At each place, we check whether 
query q can be matched by a sub-trajectory starting at the place. 
Here a key idea is that we reuse the computation of the sub-problem 
of finding the minimum match sub-trajectory starting at the pre- 
ceding place for processing the sub-problem of finding matching 
sub-trajectory starting at the current place. After we process all the 
sub-problems, we will find a minimum match, if any. 

We now introduce lemmas required for developing the algorithm. 
Based on Definition[3] we have the following proposition. 

Proposition 1: If a sub-trajectory Tl^.Lf from place L, to place L e 
matches q, then any sub-trajectory containing T^.Lf matches q. If 
a sub-trajectory T^,.Lf is not a match of q, then any sub-trajectory 
of T^.Lf is not a match. □ 

Lemma 3: If a sub-trajectory T^.Lf is a minimum match of a 
query q, and sub-trajectory 1%,. is a match of query q such 
that ps < s and ed > e, then matchDist(q, T^.Lf ) < matchDist(q, 

Proof Sketch: We can prove the lemma by the distance triangle in- 
equality. The distance between q and L s must be smaller than the 
sum of the distance between q and L ps and the distance between 
Lp S and L s . □ 

Based on Lemma[3] we have the following proposition. 

Proposition 2: If a sub-trajectory 1%^ is contained by sub-trajectory 
TH1.2' me match distance of to query q is smaller than that of 
?<K. 2 . ' a 

Lemma 4: Let sub-trajectory TS^.L* be a match of query q. The 
maximal distance of all places in 1%^.L e s to query q, i.e., 
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Procedure Match( query q, trajectory T%^, distance |) 

1 mDist <- °°; ts <- °°; te <- °°; // Result variables 

2 C <— an array of \q.\\i\ elements of ; // used as the 

counter for each query word 

3 for each word w in Li .\|» do C[w] <— C[w] + 1 ; 

4 11 <— 1 ; // the last scanned place 
s b<-l; 

6 while // < n do 

7 (iim, mDist, ts, te) <— IsMatch (q, C, b, 11, mDist); 
if ism then 

9 for each word w in Lj, .\if do C[w] <— C[w] — 1 ; 

10 b <— b + 1; continue; 



n 

12 
13 
14 

15 

16 
17 

IS 
19 

20 

21 
22 
23 
24 

25 
26 



// for all elements oC 



M <- H + l; 
ifDis%, L /; ) >^then 
&<-H+l; 
C<-0; 
continue; 

for eac/i worrf w in do 
|_ C[w] <-C[w] + l; 

if min(Dist(§,L i ,),Dist( 9 ,L„)) + E^ 1 Dist(Lj, L j+] ) > £ then 
for each word w in L;, .\if do C[w] <— C[w] — 1; 
fe <— continue; 

(ism, mDist, ts, te) <- IsMatch (q, C, mDist, b, II); 
if ism then 

for each word w in L4 .\|» do C[w] <— C[w] — 1 ; 
b<-b+l; 

HI! = n and notfV w 6 q.Xf, c[w] > 0) then 
I break; // no remaining matches 



27 return (mDist, ts,te); 



Procedure IsMatch( q, C, ts, te, mDist) 



Input: query q, counter vector C, start place ts, end place te, match 

distance mDist) 
Result; ism, mDist, ts , te 



1 if V w 6 <j.V|/ C[w] > then 



//it is a match 



md = min(Dist(e, L,s),Dist(Q, L te )) + £^ Dist(L y , L y+1 ); 
if mDist > md then mDist <— md; 
return (true, mDist, ts, te); 

5 return (false); 



maXjg^ j dist(q, TR.Lj), is a lower bound of the match distance be- 
tween the sub-trajectory and q. 

Proof Sketch: The proof can be established based on triangle in- 
equality. □ 

The pseudocode of the algorithm is outlined in Procedure [Match] 
The algorithm takes in three arguments, a query q, a trajectory T^_, 
and the match distance of the current kth result. It uses a variable 
mDist to keep track of the current minimum match distance, and 
ts and te to track the start place and end place, respectively, of the 
corresponding minimum match(line 1). It uses an array C to keep 
track of the number of occurrences of query keywords (in query 
q) in a sub-trajectory (line 2). It uses a variable b to represent the 
start place of a sub-trajectory, and a variable // to represent the end 
place of a sub-trajectory. The algorithm initializes array C with the 
occurrences of query keywords in location Lj (line 3). 

For each place L//, Procedure Match searches for a match for 
sub-trajectories from L/, to L// (lines 7-24). Procedure Match scans 
places to the right of to see whether a sub-trajectory starting 
from Li, exists to match query q (lines 7-17). During the scan, 
Match updates the counter for each query keyword when it encoun- 



ters a new place L;; (lines 9-10). 

Based on the minimum match distance i; of the current kth result, 
we develop two punning strategies. 

Pruning 1: If we encounter a place L// (line 12) such that the dis- 
tance D\st(Ln,q) is larger than the minimum match distance \ of 
the current kth result, any sub-trajectory containing L;; cannot be 
a result according to Lemma |4] Hence, any sub-trajectory starting 
from a place between current and L/; cannot be a top-£ result 
and we will skip to the next point Lu + \ to search sub-trajectory 
starting from (line 15), which is equivalent to invoke proce- 
dure Match to find the minimum matching for sub-trajectory start- 
ing from L u+ i. 

Pruning 2: If min(Dist(<?, L 6 ),Dist(g, L;/)) + YljZb Di st (L/i L/+i) 
is larger than the match distance i;, sub-trajectories starting from b 
to the right cannot be a top-A: result (due to triangle inequality). 
Hence, we move to the next start place L^ + i (line 20). 

The algorithm invokes procedure I sMatch (to be explained shortly) 
to check whether a sub-trajectory starting from and ending at L// 
is a match of query q and to compute the match distance for a match 
(lines 7 and 21). 

Pruning 3: If we find a match, we stop scanning further to the 
right (lines 10 and 24). This is because the sub-trajectories gen- 
erated by further scanning contain the match sub-trajectory from 
Lj, to L;;, and thus will have larger match distance than that of the 
current one according to Proposition [2] 

If we find a match, we eliminate the contribution of place 
from C by reducing the counter C[w] by 1 if word w appears in 
Li,.\)/ (lines 9 and 23). After the elimination, C only records the 
frequencies of query keywords in sub-trajectory from Lj, + i to L;;. 
This enables us to reuse the computation at L/, to search matching 
sub-trajectory starting at L^ + i. Any sub-trajectory between L^ + i 
and L//_i must not be a match since they are contained by the sub- 
trajectory from Li to L//_i, which is not a match. Hence, to find 
match sub-trajectory starting from L/, + i, we do not need to check 
these sub-trajectories. Instead, we check the sub-trajectories start- 
ing from and ending at L;/ (line 7), and beyond if required 
(lines 11-24). In other words, we only need to scan from location 
//, rather than the start location L/, + i, due to reusing the computa- 
tion at 

Pruning 4: If the sub-trajectory from to the last place L„ can- 
not match query q, the algorithm terminates (lines 25-26) since any 
sub-trajectory of the sub-trajectory from L/, to L„ cannot match q 
according to PropositionQ] 

We proceed to present Procedure IsMatch. If every query key- 
word in q is included in the sub-trajectory from ts to te (line 1), the 
sub-trajectory matches query q, and the Procedure computes the 
match distance md (line 2), and updates mDist with rad(line 3). 

The correctness of the algorithm is obvious: If there exists a min- 
imum match in TR for query q, the match must starts with a place 
in TR, our algorithm is able to find the minimum match starting 
from each place, and thus is able to find a minimum match if there 
is one. 

Complexity: Procedure Match is a linear time algorithm, and its 
complexity is 0{\ C T1{\). Note that the words of each location are 
processed twice at most (once as the end of a sub-trajectory and 
the other as the head). Two tricks in procedure Match are essen- 
tial to achieve the linear complexity: 1) we divide the task to sub- 
problems of finding the minimum match starting from each place; 
and 2) we are able to reuse the computation for the sub-problem in 
the preceding place. 

5. EXPERIMENTS 
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We conduct extensive experiments on real trajectory datasets to 
study the performance of the proposed index B ck -tree for answer- 
ing TfcSK queries. We build our proposed index in BerkeleyDB. 
In the following experiments, our approach is compared with four 
baseline algorithms, including IF, RT, IRT and ITB-tree. ITB-tree 
is presented in Appendix [B] 

5.1 Experimental Settings 

We crawl three real spatial trajectory datasets, located in US, 
France and Germany, respectively, from online travel route shar- 
ing web sites [30. In the US dataset, there are 12,832 trajectories 
and each trajectory contains around 60 locations. France dataset 
contains 27, 689 trajectories and each trajectory contains around 78 
locations. The Germany dataset contains 40,000 trajectories and 
each trajectory contains an average of 40 locations. We use a real 
question and answer dataset to attach text to the locations in each 
trajectory. The dataset is publicly available from Yahoo! Webscope 
and contains 3, 895, 298 questions and their answers (Q&As), writ- 
ten in English. Dataset France and Germany are generated by ran- 
domly selecting a question for a location in the France and Ger- 
many trajectories. For the US dataset, we attach both a question 
and its answer to the locations in the US data. That is, the trajec- 
tories in the US dataset are associated with much more keywords 
than those in the other two datasets. 

In addition to the data from online route sharing web sites, we 
also generate a real trajectory dataset from Flickr. We retrieve pho- 
tos in New York City with shotting time, geo-location and descrip- 
tive tags from the same user and used them to generate trajectories 
based on the approach 1 19]. This dataset contains 19, 104 trajecto- 
ries and each trajectory contains around 4 locations. 

The detailed statistics of the three generated datasets are given in 
Tabled 





US 


France 


Germany 


Flickr 


#traj 


12,832 


27,689 


40,000 


19,104 


#location 


760,516 


1,608,412 


1,314,243 


55,059 


#word 


26,792,407 


9,098,284 


5,620,720 


2,654,477 


#distinct-word 


452,734 


244,779 


164,882 


58,917 



Table 1: Datasets statistics 

In order to evaluate the scalability, we also generate datasets with 
different number of trajectories and different number of locations 
per trajectory by sampling the Germany dataset. The number of 
trajectories increases from 10K to 40K and the number of locations 
in each trajectory increases from 50 to 200 respectively. We list the 
settings in Table[2] where the default values are shown in bold. 



Parameter 


Setting 


Datasets 


US, FR, GM, Flickr 


# of queries 

k in TfcSK query 

# of keywords in TfcSK query 


50 

5, 10, 15, 20, 25 
2, 3, 4, 5 


# of segments per quad cell 


400, 600, 800, 1000, 1200 



Table 2: Experimental parameters and settings 

As shown in Table [2] for each of the dataset we randomly gen- 
erate a set of 50 queries and we report the average running time. 
I/O cost is not reported in the experiments because inverted file, 



1 www.bikely.com 

2 www. gpsies.com 



R-tree and BerkeleyDB have different file I/O mechanisms and it 
is difficult to find an appropriate and fair comparison method in 
terms of I/O cost. In the experiments, we vary the number k in the 
TA:SK query from 5 to 25. To study the effect of the number of 
query keywords, we vary it from 2 to 5. Recall that our indexing 
approach relies on a grid partitioning of the spatial spaces. We also 
investigate the performance implications of different partitioning 
granularities. In particular, we vary the number limit of trajectory 
segments per cell from 400 to 1200. All the algorithms including 
the baselines are implemented in Java and run on a server installed 
with Centos operating system. 

5.2 Query Performance 

5.2.1 Effect ofk in TkSK queries 

In the first set of experiments, we fix the number of query key- 
words at 3 and study the effect of k in the top-k queries. We plot the 
average running time on the four real datasets in Figure |4] We no- 
tice that ITB incurs much higher cost than the other indexes. For in- 
stance, in the US dataset, the running time of ITB is about 3-6 times 
higher than IF and more than 10 times higher than our approach us- 
ing B ck -tree. ITB's relatively low performance is attributed to two 
reasons. First, ITB indexes locations rather than trajectories. Sec- 
ond, a leaf node in ITB only contains the locations from the same 
trajectory. Hence, the ITB index contains much more nodes than 
do the other indexes. In order to make the figures more presentable, 
we do not present the results of ITB in the figures in this section. 

Figure [4] shows that our indexing approach significantly outper- 
forms the other three baseline approaches in all datasets. Note that 
y-axes are in logarithmic scale. B ck -tree is usually around 1-2 times 
faster than IRT, the best baseline among the four baselines. Since 
IF finds all the trajectories that match the query, the running time 
remains constant for all values of k. The other three methods, on 
the other hand, incur higher cost as k increases. This is expected 
since they use the match distance of the Mi trajectory as the prun- 
ing condition. We observe that IRT performs better than RT on 
datasets US, GM and FR while IRT and RT perform almost the 
same on dataset Flickr. IRT uses the IR-tree 1 10] to prune search 
space utilizing both spacial information and text information. IRT 
is effective on US, GM and FR, in which trajectories are distributed 
over a whole country, and thus the overlap among the MBRs of tra- 
jectories is relatively small although IRT takes a whole trajectory 
as an object. On the three datasets, RT is worse than IRT since RT 
is based on the R-tree and only uses spatial information to prune 
search space. However, trajectory data from Flickr is from a city, 
and simply treating a whole trajectory as an object yields very high 
overlap between MBRs and thus degrades the pruning power of 
text information of the IR-tree used in the IRT algorithm. The over- 
lap between MBRs also explains why RT performs poor on dataset 
Flickr. 

5.2.2 Effect of the number of query keywords 

Next, we study the query performance when varying the num- 
ber of query keywords from 2 to 5. The results are presented in 
Figure [5] The y-axes are also in logarithmic scale. Again, our ap- 
proach provides results with the best running time over all the three 
datasets, and it runs 1-2 times faster than the best baseline, IRT. For 
IF, we observe that the more keywords are queried, the faster the 
results are returned. This is because IF has more query keywords 
to do the filtering, and IF compute the match distance for fewer 
trajectories that cover the query keywords. For the other tree-based 
approaches, more query keywords require more I/O cost to read the 
posting lists, and thus the running time increases slightly. 
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Figure 4: Varying k in T/tSK queries 
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Figure 5: Varying the number of query keywords 

5.2.3 Effect of partition granularity 

We now proceed to study the query performance of the proposed 
index with regard to the partition granularity. Recall that we set a 
limit for the number of trajectory segments in each cell. A cell splits 
into 4 sub-cells when the number of segments exceeds the limit. 
In this experiment, we vary the number limit from 400 to 1200. 
The results of running time and I/O cost are shown in Figure [6] 
From the figure, we can conclude that our approach is not sensitive 
to the partition granularity. With finer partition, the performance 



slightly degrades because more cells are scanned but few additional 
trajectories are pruned. However, this performance degradation is 
so small that it is negligible. In particular, when varying the limit 
from 400 to 1200, the running time degradation is only 0.03s. 
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Figure 6: Varying the number limit of trajectory segments in a 
cell 



5.2.4 Scalability 

Finally, we evaluate the scalability. In this experiment, we report 
two sets of results. In the first set, we fix the number of locations in 
each trajectory at 50 and vary the number of trajectories from 10K 
to 40K. In the second one, we use one datasets with 20K trajecto- 
ries and vary the number of locations in each trajectory from 50 to 
200. The running times are shown in Figure [7] As expected, all 
of the four methods take linear/sublinear time. We also notice that 
the proposed method B^-tree scales much better than do the other 
methods when increasing the number of trajectories. 
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6. RELATED WORK 

Trajectory Query 

To the best of our knowledge, no work has considered answering 
the proposed TkSK queries for trajectory data. 

Related to the TkSK queries is the keyword range queries sup- 
ported in some online GPS trajectory sharing applications, e.g., 
Mountain Bike (www.bikly.com), GPS sharing, etc., in which users 
can share, browse and search GPS trajectories. They allow users to 
specify a region and a set of keywords, and return the trajectories 
that are inside the query region and contain the set of query key- 
words. However, the algorithms used are not publicized. 

Existing work on spital-temporal trajectory indexing schemes | 6 
11 22 26 27 1 clearly focuses on trajectories without text data. These 
index structures are usually designed for keep tracking of moving 
objects. A number of algorithms have been proposed to process 
different types of spatial-temporal queries, such as k nearest neigh- 
bor queries ( e.g., finding the k-closest objects with respect to a 
given point at a given time), range queries ( e.g., finding all objects 
within a given area), and complex spatial pattern queries I9I15I28I . 

A number of similarity functions and algorithms have been de- 
veloped to compute the similarity between trajectories/time series 
data, e.g., ["3.7. 29 ] . Also, there exist work on trajectory pattern 
discovery 11201 . clustering trajectories [18], and finding significant 
locations from trajectories J5]. 
Spatial Keyword Search 

Zhou et al. |32| handle the problem of retrieving web documents 
relevant to a keyword query within a pre-specified spatial region. 
Similar problem is also considered by Chen et al. [8] and Hariharan 



et al. 1171 . These proposals use loose combinations of an inverted 
file and a spatial index (e.g., R-tree). The query processing in these 
proposals occurs in two stages: One type of indexing (e.g. inverted 
list) is used to filter web document in the first stage, and then the 
other index (e.g. R-tree) is employed, or the vice versa. This index 
has the disadvantage that it cannot simultaneously prune the search 
space using both keywords and spatial distance. 

Felipe et al. 1121 propose a novel index structure called IR 2 -tree 
that augments an R-tree with signatures. For the first time, the 
new hybrid index structure enables to utilize both spatial informa- 
tion and text information prune search space at query time, which 
advances the state-of-the art in spatial-keyword query processing. 
However, this proposal suffers from the crucial limits of signature 
files (e.g., the number of false matches is linear in the collection 
size 1331 ). Further, the IR 2 -tree faces the challenge of whether the 
signatures possess enough pruning power to offset the extra cost 
incurred by the taller trees that result from inclusion of signatures. 

The hybrid index structure that combines R*-tree and bitmap in- 
dexing is developed to process a new query called m-closest key- 
word query [30] that returns the closest objects containing at least 
m keywords. This index structure exhibits the same problems as do 
signature-file based indexing 11121 . 

The hybrid index structure IR-tree [10] that integrates the R-tree 
and inverted file enables the efficient processing of the location- 
aware top-fc ranking query by utilizing both location and text in- 
formation to prune the search space. In the IR-tree [10] the fanout 
of the tree is independent of the number of words of objects in the 
dataset, and, during query processing, only (a few) posting lists rel- 
evant to the query keywords need to be fetched. A recent proposed 
index named Spatial Inverted Index [ 24 1 maps each keyword to a 
distinct aggregated R-tree |2T| that stores the objects containing 
the given keyword. The collective spatial keyword query [5] aims 
to retrieve a group of nearby objects that cover the query keywords. 

None of these proposals considers trajectory data associated with 
text as does this paper. Moreover, these proposed hybrid index 
solutions are not supported by the mainstream DBMSs. In contrast, 
the proposed solution in this paper is ready to be implemented on 
the DBMSs. 

Finally, note that the proposed TCSK query is complementary 
to the route planning queries(e.g., 11251 ), which return a route of 
places from a spatial database such that the route covers a set of 
query keywords and the travel distance is minimized. 

7. CONCLUSION 

This paper proposes a new algorithm IE for efficiently answering 
TCSK queries on trajectory data associated with text descriptions. 
The algorithm is developed based on a new hybrid index called cell- 
keyword conscious B + -tree, denoted by B ck -tree. B ck -tree allows 
us to develop algorithms that exploit both text relevance and loca- 
tion proximity to facilitate efficient and effective query processing. 
Additionally, the algorithm Match is proposed for efficiently com- 
puting the match distance between a query and a trajectory. The 
experimental results demonstrate that the proposed algorithm out- 
performs several baseline algorithms significantly and offers good 
scalability. 
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APPENDIX 

A. DISCUSSION ON UPDATES 

Deleting and replacing trajectories would seldom happen to a trajectory 
repository. Thus we only consider inserting new trajectories. To insert a 
new trajectory, the insertion algorithm first locates those cells into which 
the trajectory falls. After that, it checks whether the number of trajectory 
segments in each of the located cells is still within the limit. If it is the case, 
the algorithm associates the words of the trajector y wit h the corresponding 
cells according to the method discussed in Section RTTl and inserts the entry 
<wordID, celllD, tID> into the B + -tree. For a cell where the number of tra- 
jectory segments exceeds the limit after insertion, the cell will be split into 
4 sub-cells. A re-computation of the word-cell association is needed. After 
that, the algorithm inserts into the B + -tree the entries with respect to the 
newly created four (sub-)cells and removes the obsolete entries associated 
to the old cell. 

B. BASELINE 4: ITB-TREE INDEX BASED 
ALGORITHM 

The baselines TR and IRT treat each trajectory as an object to build in- 
dex. The ITB-tree index treats each location of a trajectory, rather than the 
whole trajectory, as an object. 

We proceed to briefly present an index structure, the ITB-tree (Inverted 
file augmented TB-tree), and the idea of an algorithm based on the ITB-tree 
for the KSK query. The ITB-tree is essentially a TB-tree [22 1 augmented 
with inverted files. The TB-tree [22 ] is proposed for indexing trajectory data 
without text information to efficiently support location based queries. The 
ITB-tree inherits the property of TB-tree 1221 that is capable of preserving 
consecutive locations of the same trajectory in an index. 

Each leaf node in the ITB-tree contains entries of the form e = (A,\\l), 
where e represents a place of a trajectory in dataset (D, e.A is the minimum 
bounding rectangle (MBR), which is a point for a place, and e.\!f refers 
to the id of the text description of the place. Each leaf node contains a 
pointer to an inverted file with the text descriptions of the objects stored in 
the node. In addition, each leaf node maintains two pointers (forward and 
backward) that link the leaf node to other leaf nodes that contain adjacent 
sub-trajectories of the sub-trajectory contained in the leaf node. 

Each non-leaf node CN in the ITB-tree contains a number of entries of 
the form (e,\,\\l) where e is the address of a child node of R, X is the MBR 
of all rectangles in entries of the child node, and \\l is the identifier of a 
pseudo text description that is the union of all text descriptions in the entries 
of the child node. The pseudo text description is a union of the text descrip- 
tions of the children nodes. Each non-leaf node also contains a pointer to 
an inverted file with the text descriptions of the entries stored in the node. 

We treat query q as a set of partial queries, where each partial query has 
a keyword in q.Vf and the spatial component q\. For each partial query 
we find its nearest places incrementally using the ITB-tree index. When a 
trajectory is covered by all the partial queries, i.e., some place in the tra- 
jectory is retrieved as a nearby place for each partial query, we choose the 
trajectory as a candidate and compute the match distance of the trajectory 
to the query. Intuitively, the trajectory would be a good candidate of the 
top-/: results since it contains all the query keywords and its places covering 
the keywords are close to the spatial component of the query. The detailed 
pseudo code of the partial query evaluation Algorithm can be found in our 
technical report. 
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